Sales forecasting¶

In this project we going to establish a model for forecasting sales of different items from different stores. Our dataset contains the sales data for 10 stores and 50 items.

Data set:¶

Data fields date - Date of the sale data. There are no holiday effects or store closures.

store - Store ID

item - Item ID

sales - Number of items sold at a particular store on a particular date.

Data Source¶

https://www.kaggle.com/competitions/demand-forecasting-kernels-only/data

In [1]:
## Change the working directory
import os
os.chdir("E:/Time series/Fine Grained Demand Forecasting")
print(os.getcwd())
E:\Time series\Fine Grained Demand Forecasting
In [2]:
## getting the dataset
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
Out[2]:
date store item sales
0 2013-01-01 1 1 13
1 2013-01-02 1 1 11
2 2013-01-03 1 1 14
3 2013-01-04 1 1 13
4 2013-01-05 1 1 10
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 913000 entries, 0 to 912999
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   date    913000 non-null  object
 1   store   913000 non-null  int64 
 2   item    913000 non-null  int64 
 3   sales   913000 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 27.9+ MB
In [4]:
## Sorting the dataset according to store number and item number
df_sorted = df.sort_values(by=['store', 'item'], ascending=[True, True])
df_sorted
Out[4]:
date store item sales
0 2013-01-01 1 1 13
1 2013-01-02 1 1 11
2 2013-01-03 1 1 14
3 2013-01-04 1 1 13
4 2013-01-05 1 1 10
... ... ... ... ...
912995 2017-12-27 10 50 63
912996 2017-12-28 10 50 59
912997 2017-12-29 10 50 74
912998 2017-12-30 10 50 62
912999 2017-12-31 10 50 82

913000 rows × 4 columns

In [5]:
## Create dataframe for store 1
df_1=df_sorted.iloc[:91300]
df_1
Out[5]:
date store item sales
0 2013-01-01 1 1 13
1 2013-01-02 1 1 11
2 2013-01-03 1 1 14
3 2013-01-04 1 1 13
4 2013-01-05 1 1 10
... ... ... ... ...
896561 2017-12-27 1 50 38
896562 2017-12-28 1 50 52
896563 2017-12-29 1 50 59
896564 2017-12-30 1 50 66
896565 2017-12-31 1 50 45

91300 rows × 4 columns

In [6]:
# Pivot without aggregation ( this will put the sales for each items in columns to create a multivariable time series)
pivot_df_1 = df_1.pivot( index='date',columns='item', values='sales')
## Changing column names to item_1,..
columnnames = {}
count = 0
for i in pivot_df_1.columns:

  count += 1

  columnnames[i] = f"store1_item_{count}"

#columnnames
pivot_df_1.rename(columns = columnnames ,inplace = True)
# Rename the dataframe to df_1
df_1=pivot_df_1
df_1.head()  ## date is already in index.
Out[6]:
item store1_item_1 store1_item_2 store1_item_3 store1_item_4 store1_item_5 store1_item_6 store1_item_7 store1_item_8 store1_item_9 store1_item_10 ... store1_item_41 store1_item_42 store1_item_43 store1_item_44 store1_item_45 store1_item_46 store1_item_47 store1_item_48 store1_item_49 store1_item_50
date
2013-01-01 13 33 15 10 11 31 25 33 18 37 ... 6 21 22 20 37 30 17 21 18 30
2013-01-02 11 43 30 11 6 36 23 37 23 34 ... 15 24 27 15 40 30 15 26 10 32
2013-01-03 14 23 14 8 8 18 34 38 25 32 ... 5 14 19 11 42 30 5 25 17 25
2013-01-04 13 18 10 19 9 19 36 54 22 45 ... 9 22 29 22 49 37 13 26 22 32
2013-01-05 10 34 23 12 8 31 38 51 29 35 ... 13 18 34 19 52 28 12 28 15 35

5 rows × 50 columns

In [7]:
## Changing the index to datetime 
df_1.index = pd.to_datetime(df_1.index)
df_1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1826 entries, 2013-01-01 to 2017-12-31
Data columns (total 50 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   store1_item_1   1826 non-null   int64
 1   store1_item_2   1826 non-null   int64
 2   store1_item_3   1826 non-null   int64
 3   store1_item_4   1826 non-null   int64
 4   store1_item_5   1826 non-null   int64
 5   store1_item_6   1826 non-null   int64
 6   store1_item_7   1826 non-null   int64
 7   store1_item_8   1826 non-null   int64
 8   store1_item_9   1826 non-null   int64
 9   store1_item_10  1826 non-null   int64
 10  store1_item_11  1826 non-null   int64
 11  store1_item_12  1826 non-null   int64
 12  store1_item_13  1826 non-null   int64
 13  store1_item_14  1826 non-null   int64
 14  store1_item_15  1826 non-null   int64
 15  store1_item_16  1826 non-null   int64
 16  store1_item_17  1826 non-null   int64
 17  store1_item_18  1826 non-null   int64
 18  store1_item_19  1826 non-null   int64
 19  store1_item_20  1826 non-null   int64
 20  store1_item_21  1826 non-null   int64
 21  store1_item_22  1826 non-null   int64
 22  store1_item_23  1826 non-null   int64
 23  store1_item_24  1826 non-null   int64
 24  store1_item_25  1826 non-null   int64
 25  store1_item_26  1826 non-null   int64
 26  store1_item_27  1826 non-null   int64
 27  store1_item_28  1826 non-null   int64
 28  store1_item_29  1826 non-null   int64
 29  store1_item_30  1826 non-null   int64
 30  store1_item_31  1826 non-null   int64
 31  store1_item_32  1826 non-null   int64
 32  store1_item_33  1826 non-null   int64
 33  store1_item_34  1826 non-null   int64
 34  store1_item_35  1826 non-null   int64
 35  store1_item_36  1826 non-null   int64
 36  store1_item_37  1826 non-null   int64
 37  store1_item_38  1826 non-null   int64
 38  store1_item_39  1826 non-null   int64
 39  store1_item_40  1826 non-null   int64
 40  store1_item_41  1826 non-null   int64
 41  store1_item_42  1826 non-null   int64
 42  store1_item_43  1826 non-null   int64
 43  store1_item_44  1826 non-null   int64
 44  store1_item_45  1826 non-null   int64
 45  store1_item_46  1826 non-null   int64
 46  store1_item_47  1826 non-null   int64
 47  store1_item_48  1826 non-null   int64
 48  store1_item_49  1826 non-null   int64
 49  store1_item_50  1826 non-null   int64
dtypes: int64(50)
memory usage: 727.5 KB
In [8]:
# Checking the missing values
df_1.isna().sum()
Out[8]:
item
store1_item_1     0
store1_item_2     0
store1_item_3     0
store1_item_4     0
store1_item_5     0
store1_item_6     0
store1_item_7     0
store1_item_8     0
store1_item_9     0
store1_item_10    0
store1_item_11    0
store1_item_12    0
store1_item_13    0
store1_item_14    0
store1_item_15    0
store1_item_16    0
store1_item_17    0
store1_item_18    0
store1_item_19    0
store1_item_20    0
store1_item_21    0
store1_item_22    0
store1_item_23    0
store1_item_24    0
store1_item_25    0
store1_item_26    0
store1_item_27    0
store1_item_28    0
store1_item_29    0
store1_item_30    0
store1_item_31    0
store1_item_32    0
store1_item_33    0
store1_item_34    0
store1_item_35    0
store1_item_36    0
store1_item_37    0
store1_item_38    0
store1_item_39    0
store1_item_40    0
store1_item_41    0
store1_item_42    0
store1_item_43    0
store1_item_44    0
store1_item_45    0
store1_item_46    0
store1_item_47    0
store1_item_48    0
store1_item_49    0
store1_item_50    0
dtype: int64
In [9]:
print(df_1.index.min())
print(df_1.index.max())
2013-01-01 00:00:00
2017-12-31 00:00:00
In [10]:
df_1["store1_item_1"].max()
Out[10]:
50

Plotting Time Series¶

We are creating time series plot for only first five items. Putting all of 50 grpahs together will be messy.

In [11]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Initialize the figure
fig = go.Figure()

# Define number of rows and columns for the subplots
rows, cols = 5, 1 # 10x5 grid

# Create subplots
fig = make_subplots(rows=rows, cols=cols, shared_xaxes=False, shared_yaxes=False,
                    vertical_spacing=0.1, horizontal_spacing=0.1,
                    subplot_titles=df_1.columns[0:5])
# Add each variable to a separate subplot
for i, column in enumerate(df_1.columns[0:5]):
    row = (i // cols) + 1
    col = (i % cols) + 1
    fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=column), row=row, col=col)

# Customize layout
fig.update_layout(
    title="Multivariable Time Series with Grid Subplots",
    xaxis_title="Date",
    height=1300,  # Adjust height for overall layout
    showlegend=False,
    template="plotly_dark"
)
# Customize y-axis titles for each subplot
for i, column in enumerate(df_1.columns[0:5], start=1):
    fig['layout'][f'yaxis{i}']['title'] = column

# Show the figure
fig.show()

Seasonal Decomposition¶

Seasonal decomposition separates a time series into three components: Trend, Seasonal, and Residual (Error). For multiple time series variables, this can be done by decomposing each variable individually and then plotting these components side by side in Plotly.

To perform seasonal decomposition in Python, you can use the seasonal_decompose function from the statsmodels library, and then visualize each component in a Plotly subplot. For seasonal decomposition, we have considered periods weekly, monthly and quarterly.

In [12]:
## Seasonal Decomposition with weekly period

from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}

# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
    decomposed = seasonal_decompose(df_1[column], model='additive', period=7) 
    decomposed_data[column] = decomposed

# Set up the subplot grid
rows = 5 * 3  # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1  # Only one column

# Create subplots
fig = make_subplots(
    rows=rows, cols=cols, shared_xaxes=False,
    vertical_spacing=0.02,
    subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)

# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
    decomposed = decomposed_data[column]
    
    # Calculate the starting row for each variable
    start_row = var_idx * 3 + 1
    
    # Original time series
    #fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
    
    
    # Trend component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
                  row=start_row, col=1)
    
    # Seasonal component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
                  row=start_row + 1, col=1)
    
    # Residual component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
                  row=start_row + 2, col=1)

# Customize layout
fig.update_layout(
    title="Seasonal Decomposition of Multiple Items of Store 1",
    height=700 * len(df_1.columns[ : 5]),  # Adjust height based on number of variables
    template="plotly_dark",
    showlegend=False
)

# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
    start_row = var_idx * 3 + 1
    #fig.update_yaxes(title_text="column", row=start_row, col=1)
    fig.update_yaxes(title_text="Trend", row=start_row, col=1)
    fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
    fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)

# Show the figure
fig.show()
In [22]:
decomposed.seasonal
Out[22]:
date
2013-01-01   -1.374647
2013-01-02   -1.078493
2013-01-03    0.411617
2013-01-04    0.686892
2013-01-05    1.922057
                ...   
2017-12-27   -1.078493
2017-12-28    0.411617
2017-12-29    0.686892
2017-12-30    1.922057
2017-12-31    2.991287
Name: seasonal, Length: 1826, dtype: float64
In [23]:
decomposed.trend
Out[23]:
date
2013-01-01          NaN
2013-01-02          NaN
2013-01-03          NaN
2013-01-04     9.428571
2013-01-05     9.285714
                ...    
2017-12-27    13.857143
2017-12-28    14.142857
2017-12-29          NaN
2017-12-30          NaN
2017-12-31          NaN
Name: trend, Length: 1826, dtype: float64
In [24]:
decomposed.resid
Out[24]:
date
2013-01-01         NaN
2013-01-02         NaN
2013-01-03         NaN
2013-01-04   -1.115463
2013-01-05   -3.207771
                ...   
2017-12-27   -6.778650
2017-12-28    0.445526
2017-12-29         NaN
2017-12-30         NaN
2017-12-31         NaN
Name: resid, Length: 1826, dtype: float64
In [ ]:
 
In [25]:
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
Out[25]:
seasonal trend resid actual_values
date
2013-01-01 -1.374647 NaN NaN 11.0
2013-01-02 -1.078493 NaN NaN 6.0
2013-01-03 0.411617 NaN NaN 8.0
2013-01-04 0.686892 9.428571 -1.115463 9.0
2013-01-05 1.922057 9.285714 -3.207771 8.0
2013-01-06 2.991287 9.428571 0.580141 13.0
2013-01-07 -3.558713 9.428571 5.130141 11.0
2013-01-08 -1.374647 9.571429 1.803218 10.0
2013-01-09 -1.078493 9.571429 -1.492936 7.0
2013-01-10 0.411617 9.285714 -1.697331 8.0
2013-01-11 0.686892 8.428571 0.884537 10.0
2013-01-12 1.922057 8.428571 -2.350628 8.0
2013-01-13 2.991287 8.285714 -0.277002 11.0
2013-01-14 -3.558713 8.142857 0.415856 5.0
2013-01-15 -1.374647 8.142857 3.231790 10.0
2013-01-16 -1.078493 8.142857 -1.064364 6.0
2013-01-17 0.411617 7.857143 -1.268760 7.0
2013-01-18 0.686892 7.714286 1.598823 10.0
2013-01-19 1.922057 7.571429 -1.493485 8.0
2013-01-20 2.991287 7.571429 -1.562716 9.0
2013-01-21 -3.558713 8.285714 -0.727002 4.0
2013-01-22 -1.374647 7.571429 2.803218 9.0
2013-01-23 -1.078493 7.428571 -0.350078 6.0
2013-01-24 0.411617 8.714286 2.874097 12.0
2013-01-25 0.686892 9.428571 -5.115463 5.0
2013-01-26 1.922057 9.428571 -4.350628 7.0
2013-01-27 2.991287 9.857143 5.151570 18.0
2013-01-28 -3.558713 10.000000 2.558713 9.0
2013-01-29 -1.374647 10.428571 -0.053925 9.0
2013-01-30 -1.078493 11.714286 -1.635793 9.0
2013-01-31 0.411617 10.714286 1.874097 13.0
2013-02-01 0.686892 10.571429 -3.258320 8.0
2013-02-02 1.922057 11.428571 2.649372 16.0
2013-02-03 2.991287 11.142857 -3.134144 11.0
2013-02-04 -3.558713 10.714286 0.844427 8.0
2013-02-05 -1.374647 10.571429 5.803218 15.0
2013-02-06 -1.078493 9.428571 -1.350078 7.0
2013-02-07 0.411617 9.285714 0.302669 10.0
2013-02-08 0.686892 10.000000 -3.686892 7.0
2013-02-09 1.922057 9.714286 -3.636342 8.0
2013-02-10 2.991287 9.857143 -2.848430 10.0
2013-02-11 -3.558713 9.857143 6.701570 13.0
2013-02-12 -1.374647 10.428571 3.946075 13.0
2013-02-13 -1.078493 10.714286 -1.635793 8.0
2013-02-14 0.411617 10.857143 -1.268760 10.0
2013-02-15 0.686892 9.714286 0.598823 11.0
2013-02-16 1.922057 9.285714 -1.207771 10.0
2013-02-17 2.991287 9.142857 -1.134144 11.0
2013-02-18 -3.558713 9.285714 -0.727002 5.0
2013-02-19 -1.374647 9.857143 1.517504 10.0
In [13]:
## Seasonal Decomposition with monthly period

from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}

# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
    decomposed = seasonal_decompose(df_1[column], model='additive', period=31) 
    decomposed_data[column] = decomposed

# Set up the subplot grid
rows = 5 * 3  # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1  # Only one column

# Create subplots
fig = make_subplots(
    rows=rows, cols=cols, shared_xaxes=False,
    vertical_spacing=0.02,
    subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)

# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
    decomposed = decomposed_data[column]
    
    # Calculate the starting row for each variable
    start_row = var_idx * 3 + 1
    
    # Original time series
    #fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
    
    
    # Trend component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
                  row=start_row, col=1)
    
    # Seasonal component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
                  row=start_row + 1, col=1)
    
    # Residual component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
                  row=start_row + 2, col=1)

# Customize layout
fig.update_layout(
    title="Seasonal Decomposition of Multiple Items of Store 1",
    height=700 * len(df_1.columns[ : 5]),  # Adjust height based on number of variables
    template="plotly_dark",
    showlegend=False
)

# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
    start_row = var_idx * 3 + 1
    #fig.update_yaxes(title_text="column", row=start_row, col=1)
    fig.update_yaxes(title_text="Trend", row=start_row, col=1)
    fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
    fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)

# Show the figure
fig.show()
In [27]:
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
Out[27]:
seasonal trend resid actual_values
date
2013-01-01 0.000462 NaN NaN 11.0
2013-01-02 -0.553486 NaN NaN 6.0
2013-01-03 0.340841 NaN NaN 8.0
2013-01-04 -0.675288 NaN NaN 9.0
2013-01-05 0.474322 NaN NaN 8.0
2013-01-06 0.002687 NaN NaN 13.0
2013-01-07 -0.309326 NaN NaN 11.0
2013-01-08 0.222932 NaN NaN 10.0
2013-01-09 -0.015110 NaN NaN 7.0
2013-01-10 -0.589638 NaN NaN 8.0
2013-01-11 -0.284855 NaN NaN 10.0
2013-01-12 -0.321562 NaN NaN 8.0
2013-01-13 0.191230 NaN NaN 11.0
2013-01-14 0.368971 NaN NaN 5.0
2013-01-15 -0.144894 NaN NaN 10.0
2013-01-16 0.717926 8.903226 -3.621152 6.0
2013-01-17 0.803021 8.806452 -2.609472 7.0
2013-01-18 -0.530127 9.129032 1.401095 10.0
2013-01-19 1.103911 9.225806 -2.329717 8.0
2013-01-20 -0.275956 9.193548 0.082407 9.0
2013-01-21 -0.033464 9.419355 -5.385891 4.0
2013-01-22 0.241842 9.225806 -0.467648 9.0
2013-01-23 0.237948 9.193548 -3.431497 6.0
2013-01-24 0.530496 9.096774 2.372730 12.0
2013-01-25 -0.748703 9.129032 -3.380329 5.0
2013-01-26 -0.460049 9.193548 -1.733499 7.0
2013-01-27 -0.114666 9.290323 8.824343 18.0
2013-01-28 0.376992 9.451613 -0.828605 9.0
2013-01-29 -0.698648 9.354839 0.343809 9.0
2013-01-30 -0.066835 9.516129 -0.449294 9.0
2013-01-31 0.209027 9.548387 3.242585 13.0
2013-02-01 0.000462 9.677419 -1.677882 8.0
2013-02-02 -0.553486 9.806452 6.747035 16.0
2013-02-03 0.340841 9.645161 1.013998 11.0
2013-02-04 -0.675288 9.709677 -1.034389 8.0
2013-02-05 0.474322 9.645161 4.880516 15.0
2013-02-06 0.002687 9.870968 -2.873655 7.0
2013-02-07 -0.309326 10.064516 0.244810 10.0
2013-02-08 0.222932 10.290323 -3.513254 7.0
2013-02-09 -0.015110 10.193548 -2.178438 8.0
2013-02-10 -0.589638 10.129032 0.460605 10.0
2013-02-11 -0.284855 10.096774 3.188080 13.0
2013-02-12 -0.321562 9.806452 3.515110 13.0
2013-02-13 0.191230 9.903226 -2.094456 8.0
2013-02-14 0.368971 9.967742 -0.336713 10.0
2013-02-15 -0.144894 10.193548 0.951346 11.0
2013-02-16 0.717926 10.129032 -0.846958 10.0
2013-02-17 0.803021 10.064516 0.132463 11.0
2013-02-18 -0.530127 9.870968 -4.340841 5.0
2013-02-19 1.103911 9.806452 -0.910362 10.0
In [14]:
## Seasonal Decomposition with quarterly period

from statsmodels.tsa.seasonal import seasonal_decompose
# Dictionary to store decomposition results
decomposed_data = {}

# Perform seasonal decomposition for each variable
for column in df_1.columns[:5]:
    decomposed = seasonal_decompose(df_1[column], model='additive', period=31) 
    decomposed_data[column] = decomposed

# Set up the subplot grid
rows = 5 * 3  # 3 rows per variable (Trend, Seasonal, Residual)
cols = 1  # Only one column

# Create subplots
fig = make_subplots(
    rows=rows, cols=cols, shared_xaxes=False,
    vertical_spacing=0.02,
    subplot_titles=[f"{column} - {component}" for column in df_1.columns[ : 5] for component in ['Trend', 'Seasonal', 'Residual']]
)

# Add decomposed components to subplots
for var_idx, column in enumerate(df_1.columns[ : 5]):
    decomposed = decomposed_data[column]
    
    # Calculate the starting row for each variable
    start_row = var_idx * 3 + 1
    
    # Original time series
    #fig.add_trace(go.Scatter(x=df_1.index, y=df_1[column], mode='lines', name=f"{column}"),row=start_row, col=1)
    
    
    # Trend component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.trend, mode='lines', name=f"{column} Trend"),
                  row=start_row, col=1)
    
    # Seasonal component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.seasonal, mode='lines', name=f"{column} Seasonal"),
                  row=start_row + 1, col=1)
    
    # Residual component
    fig.add_trace(go.Scatter(x=df_1.index, y=decomposed.resid, mode='lines', name=f"{column} Residual"),
                  row=start_row + 2, col=1)

# Customize layout
fig.update_layout(
    title="Seasonal Decomposition of Multiple Items of Store 1",
    height=700 * len(df_1.columns[ : 5]),  # Adjust height based on number of variables
    template="plotly_dark",
    showlegend=False
)

# Customize y-axis titles for each subplot
for var_idx, column in enumerate(df_1.columns[ : 5]):
    start_row = var_idx * 3 + 1
    #fig.update_yaxes(title_text="column", row=start_row, col=1)
    fig.update_yaxes(title_text="Trend", row=start_row, col=1)
    fig.update_yaxes(title_text="Seasonal", row=start_row + 1, col=1)
    fig.update_yaxes(title_text="Residual", row=start_row + 2, col=1)

# Show the figure
fig.show()
In [29]:
df_reconstructed = pd.concat([decomposed.seasonal, decomposed.trend, decomposed.resid, decomposed.observed], axis = 1)
df_reconstructed.columns = ['seasonal', 'trend', 'resid', 'actual_values']
df_reconstructed.head(50)
Out[29]:
seasonal trend resid actual_values
date
2013-01-01 0.000462 NaN NaN 11.0
2013-01-02 -0.553486 NaN NaN 6.0
2013-01-03 0.340841 NaN NaN 8.0
2013-01-04 -0.675288 NaN NaN 9.0
2013-01-05 0.474322 NaN NaN 8.0
2013-01-06 0.002687 NaN NaN 13.0
2013-01-07 -0.309326 NaN NaN 11.0
2013-01-08 0.222932 NaN NaN 10.0
2013-01-09 -0.015110 NaN NaN 7.0
2013-01-10 -0.589638 NaN NaN 8.0
2013-01-11 -0.284855 NaN NaN 10.0
2013-01-12 -0.321562 NaN NaN 8.0
2013-01-13 0.191230 NaN NaN 11.0
2013-01-14 0.368971 NaN NaN 5.0
2013-01-15 -0.144894 NaN NaN 10.0
2013-01-16 0.717926 8.903226 -3.621152 6.0
2013-01-17 0.803021 8.806452 -2.609472 7.0
2013-01-18 -0.530127 9.129032 1.401095 10.0
2013-01-19 1.103911 9.225806 -2.329717 8.0
2013-01-20 -0.275956 9.193548 0.082407 9.0
2013-01-21 -0.033464 9.419355 -5.385891 4.0
2013-01-22 0.241842 9.225806 -0.467648 9.0
2013-01-23 0.237948 9.193548 -3.431497 6.0
2013-01-24 0.530496 9.096774 2.372730 12.0
2013-01-25 -0.748703 9.129032 -3.380329 5.0
2013-01-26 -0.460049 9.193548 -1.733499 7.0
2013-01-27 -0.114666 9.290323 8.824343 18.0
2013-01-28 0.376992 9.451613 -0.828605 9.0
2013-01-29 -0.698648 9.354839 0.343809 9.0
2013-01-30 -0.066835 9.516129 -0.449294 9.0
2013-01-31 0.209027 9.548387 3.242585 13.0
2013-02-01 0.000462 9.677419 -1.677882 8.0
2013-02-02 -0.553486 9.806452 6.747035 16.0
2013-02-03 0.340841 9.645161 1.013998 11.0
2013-02-04 -0.675288 9.709677 -1.034389 8.0
2013-02-05 0.474322 9.645161 4.880516 15.0
2013-02-06 0.002687 9.870968 -2.873655 7.0
2013-02-07 -0.309326 10.064516 0.244810 10.0
2013-02-08 0.222932 10.290323 -3.513254 7.0
2013-02-09 -0.015110 10.193548 -2.178438 8.0
2013-02-10 -0.589638 10.129032 0.460605 10.0
2013-02-11 -0.284855 10.096774 3.188080 13.0
2013-02-12 -0.321562 9.806452 3.515110 13.0
2013-02-13 0.191230 9.903226 -2.094456 8.0
2013-02-14 0.368971 9.967742 -0.336713 10.0
2013-02-15 -0.144894 10.193548 0.951346 11.0
2013-02-16 0.717926 10.129032 -0.846958 10.0
2013-02-17 0.803021 10.064516 0.132463 11.0
2013-02-18 -0.530127 9.870968 -4.340841 5.0
2013-02-19 1.103911 9.806452 -0.910362 10.0

To compute and plot the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) in Python with Plotly, you can use statsmodels to get the ACF and PACF values and then plotly.graph_objects to create the plots.

In [15]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.stattools import acf, pacf
In [16]:
# Calculate ACF and PACF
acf_values = acf(df_1["store1_item_1"], nlags=365)  # Adjust nlags as necessary
pacf_values = pacf(df_1["store1_item_1"], nlags=365)

# Create ACF plot
acf_fig = go.Figure()
acf_fig.add_trace(go.Scatter(x=np.arange(len(acf_values)), y=acf_values, name='ACF'))
acf_fig.update_layout(
    title="Autocorrelation Function (ACF)",
    xaxis_title="Lags",
    yaxis_title="ACF",
    template="plotly_white"
)
# Create PACF plot
pacf_fig = go.Figure()
pacf_fig.add_trace(go.Scatter(x=np.arange(len(pacf_values)), y=pacf_values, name='PACF'))
pacf_fig.update_layout(
    title="Partial Autocorrelation Function (PACF)",
    xaxis_title="Lags",
    yaxis_title="PACF",
    template="plotly_white"
)

# Show the plots
acf_fig.show()
pacf_fig.show()
In [17]:
def create_corr_plot(series, plot_pacf=False):
    corr_array = pacf(series.dropna(), alpha=0.05) if plot_pacf else acf(series.dropna(), alpha=0.05)
    lower_y = corr_array[1][:,0] - corr_array[0]
    upper_y = corr_array[1][:,1] - corr_array[0]

    fig = go.Figure()
    [fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f') 
     for x in range(len(corr_array[0]))]
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color='#1f77b4',
                   marker_size=12)
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
            fill='tonexty', line_color='rgba(255,255,255,0)')
    fig.update_traces(showlegend=False)
    fig.update_xaxes(range=[-1,42])
    fig.update_yaxes(zerolinecolor='#000000')
    
    title='Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
    fig.update_layout(title=title)
    fig.show()
In [18]:
create_corr_plot(df_1["store1_item_1"], plot_pacf=False)
In [19]:
def create_corr_plot(series, plot_pacf=False, nlags=None):
    corr_array = pacf(series.dropna(), alpha=0.05,nlags=nlags) if plot_pacf else acf(series.dropna(), alpha=0.05,nlags=nlags)
    lower_y = corr_array[1][:,0] - corr_array[0]
    upper_y = corr_array[1][:,1] - corr_array[0]

    fig = go.Figure()
    [fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f') 
     for x in range(len(corr_array[0]))]
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color='#1f77b4',
                   marker_size=6)
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
            fill='tonexty', line_color='rgba(255,255,255,0)')
    fig.update_traces(showlegend=False)
    fig.update_xaxes(range=[-1, nlags + 1])
    fig.update_yaxes(zerolinecolor='#000000')
    
    title='Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
    fig.update_layout(title=title)
    fig.show()
In [20]:
create_corr_plot(df_1["store1_item_1"], plot_pacf=False,nlags=93)
In [21]:
create_corr_plot(df_1["store1_item_1"], plot_pacf=True,nlags=93)
In [22]:
create_corr_plot(df_1["store1_item_1"], plot_pacf=True,nlags=180)

Looks like AR(7) would be suitable for this time series. But we need to check the stationarity.

In [23]:
from statsmodels.tsa.stattools import adfuller
# Perform the ADF test
result = adfuller(df_1["store1_item_1"])

# Extract and display the test results
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:")
for key, value in result[4].items():
    print(f"   {key}: {value}")

# Interpret the p-value
if result[1] < 0.05:
    print("The time series is likely stationary (reject H0).")
else:
    print("The time series is likely non-stationary (fail to reject H0).")
ADF Statistic: -3.157670556332814
p-value: 0.022569380626570913
Critical Values:
   1%: -3.4339840952648695
   5%: -2.8631452508003057
   10%: -2.567624583142913
The time series is likely stationary (reject H0).

Check any other item of store 1¶

In [24]:
create_corr_plot(df_1["store1_item_5"], plot_pacf=False,nlags=93)
In [25]:
create_corr_plot(df_1["store1_item_2"], plot_pacf=True,nlags=93)
In [26]:
from statsmodels.tsa.stattools import adfuller
# Perform the ADF test
result = adfuller(df_1["store1_item_2"])

# Extract and display the test results
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:")
for key, value in result[4].items():
    print(f"   {key}: {value}")

# Interpret the p-value
if result[1] < 0.05:
    print("The time series is likely stationary (reject H0).")
else:
    print("The time series is likely non-stationary (fail to reject H0).")
ADF Statistic: -3.163241216645516
p-value: 0.0222134781646119
Critical Values:
   1%: -3.4339840952648695
   5%: -2.8631452508003057
   10%: -2.567624583142913
The time series is likely stationary (reject H0).

Lets fit AR(7) model.¶

In [27]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
In [30]:
# Convert to pandas Series
data_series = pd.Series(df_1["store1_item_1"])
                             # Split the data into train and test sets
train_size = int(len(data_series) * 0.8)
train, test = data_series[:train_size], data_series[train_size:]
# Fit an AutoRegressive (AR) model
# Choose an appropriate lag value `p`. For example, let's use p=5
model = AutoReg(train, lags=7).fit()

# Generate predictions
predictions = model.predict(start=len(train), end=len(data_series)-1, dynamic=False)   
# Calculate error (optional)
error = mean_squared_error(test, predictions)
print(f"Test MSE: {error}")

# Plot the observed and predicted values
fig = go.Figure()
                             
# Plot observed values
fig.add_trace(go.Scatter(x=data_series.index, y=data_series, mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test.index, y=predictions, mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series",
    xaxis_title="Time",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
                             
                             
import warnings
warnings.filterwarnings("ignore", category=ValueWarning)                         
                             
                             
                             
                             
Test MSE: 54.081146963782
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 37
     33 fig.show()
     36 import warnings
---> 37 warnings.filterwarnings("ignore", category=ValueWarning)

NameError: name 'ValueWarning' is not defined

Use Prophet to predict the time series.¶

$Facebook Prophet$ (now simply known as "Prophet") is an open-source forecasting tool developed by Facebook's Core Data Science team. Designed for time series forecasting, it is especially useful for handling data with strong seasonal effects and multiple seasonalities, such as daily, weekly, and yearly trends. Prophet is known for its flexibility in automatically detecting holidays and special events, making it suitable for real-world business scenarios with irregular patterns. With its intuitive interface, Prophet enables analysts and data scientists to quickly build robust forecasting models without deep knowledge of complex statistical modeling. It is implemented in both Python and R, allowing for seamless integration with popular data science workflows.

Here are the main default parameters in Facebook Prophet that influence model behavior and flexibility:

Growth: 'linear'

Determines the type of trend. Defaults to 'linear', but can be set to 'logistic' for saturating growth. Seasonality:

Yearly Seasonality: Enabled by default with a fourier_order of 10.

Weekly Seasonality: Enabled by default with a fourier_order of 3.

Daily Seasonality: Disabled by default but can be enabled with a fourier_order of 4.

Holidays: Disabled by default

Users can specify holiday effects by providing a list of holiday dates.

Changepoint Range: 0.8

The proportion of data (from the start) in which Prophet will place potential changepoints. Default is 0.8, meaning changepoints are considered in the first 80% of the data. Changepoint Prior Scale: 0.05

Controls the flexibility of the trend. Lower values (e.g., 0.01) create a smoother trend, while higher values (e.g., 0.5) allow more flexibility. Interval Width: 0.80

Defines the uncertainty interval for forecasts. By default, it produces an 80% prediction interval.

Uncertainty Samples: 1000

The number of simulations Prophet runs to estimate forecast uncertainty.

Seasonality Mode: 'additive'

Can be set to 'additive' or 'multiplicative', depending on the data's seasonal behavior. These default settings make Prophet a versatile tool for common forecasting tasks, but they can be fine-tuned to capture more complex time series behavior.

In [33]:
## Prophet with  default seasonality( 365 days)

import numpy as np
import pandas as pd
from prophet import Prophet
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error,mean_absolute_error, mean_absolute_percentage_error



# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})

# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(interval_width=0.95) #by default is 80%
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")

# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
09:46:45 - cmdstanpy - INFO - Chain [1] start processing
09:46:45 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 25.119804306628804
Test MAPE: 0.22637318281731522
In [32]:
# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df[train_size:]['ds'], y=df[train_size:]['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()

Improving a Prophet model for better time series forecasting involves several strategies, including tuning model parameters, adding additional regressors, adjusting seasonalities, and making use of domain-specific knowledge. Here are some effective approaches:

1. Tune the Growth Model¶

Prophet can handle linear and logistic growth models. By default, it uses linear growth, but if your data shows saturation effects, switching to a logistic growth model with a defined capacity can improve accuracy. Example:

model = Prophet(growth='logistic')

df['cap'] = 100 # Set the upper limit (capacity)

model.fit(df)

2. Add Seasonality Components¶

By default, Prophet detects yearly seasonality (for datasets with sufficient data) and weekly seasonality. However, you can manually add additional seasonal components (e.g., quarterly or custom seasonalities) if your data exhibits more complex patterns.

model = Prophet()

model.add_seasonality(name='quarterly', period=90.5, fourier_order=8) # Adjust period and fourier_order as needed

model.fit(df)

3. Use Additional Regressors¶

Prophet allows for adding external regressors that may influence your forecast. For example, adding information on holidays, special events, or external factors like temperature, promotions, or economic indicators can improve accuracy.

df['regressor'] = external_data # external_data is an array or series of additional data

model = Prophet()

model.add_regressor('regressor')

model.fit(df)

4. Incorporate Holiday Effects¶

Prophet has a built-in feature to account for holiday effects, which can be especially useful if the time series is affected by significant events or holidays. Prophet supports predefined holiday lists for various countries.

from prophet.make_holidays import make_holidays_df

model = Prophet(holidays=make_holidays_df('US', start='2020-01-01', end='2023-01-01'))

model.fit(df)

5. Adjust Changepoints and Flexibility¶

Changepoints allow the model to detect shifts in trend. Prophet automatically selects changepoints, but sometimes fine-tuning them helps. You can increase n_changepoints or adjust changepoint_prior_scale to allow more or less flexibility in trend shifts.

model = Prophet(changepoint_prior_scale=0.05) # Higher value increases flexibility

model.fit(df)

6. Increase Fourier Order for Seasonalities¶

Increasing the fourier_order for seasonalities allows the model to capture more complex seasonal patterns, but it also increases the risk of overfitting. Experiment with higher values if your seasonality has more intricate patterns.

model = Prophet(yearly_seasonality=False)

model.add_seasonality(name='yearly', period=365.25, fourier_order=20) # Custom yearly seasonality with higher complexity

model.fit(df)

7. Tune Seasonality and Trend Flexibility Using Cross-Validation¶

Prophet has a built-in cross-validation function that can help you assess the model’s performance with different tuning parameters and identify the best configuration.

from prophet.diagnostics import cross_validation, performance_metrics

df_cv = cross_validation(model, initial='730 days', period='180 days', horizon='365 days')

df_p = performance_metrics(df_cv)

print(df_p.head())

8. Increase Forecast Accuracy with Hyperparameter Tuning¶

Systematic tuning of changepoint_prior_scale, seasonality_prior_scale, holidays_prior_scale, and fourier_order for seasonalities can help find the best model. You could automate this using a search method (e.g., grid search) or use libraries like optuna for optimization.

9. Ensure Quality of Data¶

Prophet is sensitive to outliers and missing data, which can negatively impact model performance. Handle any missing values, outliers, or irregular frequency before modeling. By carefully tuning the Prophet model using these strategies, you can often capture more of the inherent patterns in your data and improve forecast accuracy.

1. Tune the Growth Model¶

cap=55

In [34]:
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(growth='logistic')
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
09:51:36 - cmdstanpy - INFO - Chain [1] start processing
09:51:36 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.17361803171227
Test MAPE: 0.23425785594156023

2. Add Seasonality Components¶

In [35]:
#### period=365, fourier_order=8,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(growth='logistic')

model= model.add_seasonality(name="yearly", period=365, fourier_order=8) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")

# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
09:52:02 - cmdstanpy - INFO - Chain [1] start processing
09:52:02 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.156016782911937
Test MAPE: 0.23506594627394364
In [36]:
# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df[train_size:]['ds'], y=df[train_size:]['y'], mode='lines', name='Observed'))
# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
In [46]:
#### Quarterly period=93, fourier_order=8,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(growth='logistic')

model= model.add_seasonality(name="quarterly", period=93, fourier_order=8) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")

# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
10:08:52 - cmdstanpy - INFO - Chain [1] start processing
10:08:52 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 27.130378392293093
Test MAPE: 0.23715099114755028
In [50]:
#### Monthly period=31, fourier_order=5,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(growth='logistic')

model= model.add_seasonality(name="monthly", period=31, fourier_order=5) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")

# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
10:11:17 - cmdstanpy - INFO - Chain [1] start processing
10:11:17 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.39701071627797
Test MAPE: 0.23503769651039932
In [51]:
#### Weekly period=7, fourier_order=5,growth='logistic'#####
#####################################
# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
df['cap'] = 50
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(growth='logistic')

model= model.add_seasonality(name="weekly", period=7, fourier_order=5) # Adjust period and fourier_order as needed
model.add_country_holidays("US")
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")

# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
10:12:08 - cmdstanpy - INFO - Chain [1] start processing
10:12:08 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 26.13574676458783
Test MAPE: 0.23392506720077097
In [ ]:
 
In [52]:
## Multiplicative is not good.########################
###########################################################

# Create a DataFrame with columns 'ds' and 'y' for Prophet
df = pd.DataFrame({"ds": df_1.index, "y":df_1["store1_item_1"]})
## Creating model parameters
model_param ={
    "daily_seasonality": False,
    "weekly_seasonality":True,
    "yearly_seasonality":False,
    "seasonality_mode": "multiplicative",
    "growth": "logistic"
}
df['cap'] = df["y"].max() + df["y"].std() * 0.05
# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]

# Initialize and fit the Prophet model
model = Prophet(**model_param)
model.fit(train_df)

# Make future predictions
future = model.make_future_dataframe(periods=len(test_df), freq="D")
future['cap'] = 50
forecast = model.predict(future)

# Calculate the MSE for the test set
test_forecast = forecast[-len(test_df):]
error1 = mean_squared_error(test_df['y'], test_forecast['yhat'])
error2=mean_absolute_percentage_error(test_df['y'], test_forecast['yhat'])
print(f"Test MSE: {error1}")
print(f"Test MAPE: {error2}")
# Plot observed and predicted values
fig = go.Figure()

# Plot observed (train + test) values
fig.add_trace(go.Scatter(x=df['ds'], y=df['y'], mode='lines', name='Observed'))

# Plot predicted values
fig.add_trace(go.Scatter(x=test_forecast['ds'], y=test_forecast['yhat'], mode='lines', name='Predicted'))

# Update plot layout
fig.update_layout(
    title="Observed vs Predicted Time Series (Prophet)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white"
)

fig.show()
10:12:52 - cmdstanpy - INFO - Chain [1] start processing
10:12:52 - cmdstanpy - INFO - Chain [1] done processing
Test MSE: 41.926852277876336
Test MAPE: 0.29944857764618943

Conclusion¶

Even though the time series data appears stationary, and the ACF and PACF plots suggest selecting an AR(7) model, FB Prophet outperforms the AR models. Prophet’s flexibility in capturing seasonality, holidays, and trend changes enables it to deliver more accurate forecasts in this case, making it a powerful alternative to traditional AR models for time series forecasting.

In [ ]: